Harness Engineering
AI coding agents get judged on model names, benchmarks, and release notes. That misses the layer that turns a model into a system you can trust with files, shells, browsers, credentials, and long-running work. The harness is the runtime between the model and the machine. It assembles context, exposes tools, arbitrates permissions, records state, routes work across agents, and gives humans a way to review or roll back side effects.
The harness operates as the decisive application layer around the model. It governs what capability enters the session, what the model sees, what can execute, what gets remembered, and what undergoes human review.
Harness As Control Plane
A harness acts as the runtime control plane that coordinates models doing work with side effects. It receives a user request, builds the prompt state, calls a model, routes tool calls, applies permission policy, feeds tool results back into the model, records state, and verifies the output.
user / IDE / channel
-> harness session
-> context assembler
-> model call
-> tool router
-> sandbox / permission plane
-> tools, external servers, shell, browser, files
-> verifier / review / tests
-> state, memory, checkpoints, logs
The critical boundary sits between model intent and side effect. When a model only writes text, a bad answer stops at the transcript. When a model edits files, sends network requests, runs migrations, or drives a browser, the harness must turn language into scoped action. The harness owns the registry the model sees, the permission checks before execution, and the trust decision after a result comes back. Product quality diverges from raw model capability here.
Context Assembly
An agent prompt represents assembled state, not just the user's last message. The context window includes conversation history, file contents, command outputs, explicit instructions, local memory, loaded skills, and system prompts (Codex AGENTS.md). When context fills, the harness compacts it by clearing older tool outputs or summarizing past conversation (Claude Code, How Claude Code works).
context = user request
+ durable instructions
+ relevant project files
+ compact tool registry
+ selected memories
+ recent tool results
+ policy and system context
Context assembly operates as a filter. The harness chooses which files to read fully, which results to summarize, which tools to reveal, which memories apply, and which outputs came from untrusted text. Wide context helps the model see the system. Noisy context pushes the model to follow stale rules, pick wrong API versions, or treat hostile document text as instruction. System-curated context improves model quality more than dumping raw text into the token window.
Permissions
The permission plane translates "the model wants to do this" into "the system allows this action under these conditions." A production harness enforces this through permission modes or profiles. A standard default posture lets an agent read and edit workspace files and run routine local commands, while requiring explicit human approval for internet access or external directory changes (Codex permissions).
Staged approval flows are architectural controls, not cosmetic UX. Modes that restrict agents to planning, exploration, or read-only operations separate read phases from write phases and surface the blast radius before granting write access (Claude Code, Permission modes). Network requests often require separate approval loops, and web fetching runs in isolated context windows to mitigate prompt injection (Claude Code, Security).
Tool Arbitration
Tool arbitration happens before a tool runs. Execution hooks intercept calls before execution or when an approval dialog appears (Codex hooks, Claude Code, Hooks). These hooks map namespaces, compress descriptions for the model, filter tools by session policy, and block out-of-scope calls.
Dynamic tool registries—where external servers notify clients of new or removed capabilities—require a live integration surface. The harness reacts to shifting tools and maintains boundary enforcement. Pushing side-effect policy into prose instructions ("do not call dangerous tools") is a design failure. Natural-language instructions share a channel with untrusted data; deterministic policy checks must sit outside the decoder in the arbitration layer.
State And Rollback
Agent state extends past the chat history. The harness maintains checkpoints, recording messages, tool uses, and results. When a model edits files, the harness snapshots the working tree to enable rewinds (Claude Code, How Claude Code works).
The harness does not own truth alone. Version control systems like Git own the working tree, while the harness provides a diff interface (Codex app review). Gateway-style assistants face a broader state surface: rollback has to account for queues, channel delivery, device trust, and audit events (OpenClaw Gateway architecture). Checkpoints usually cannot undo external side effects like database writes, API calls, or deployments, demanding idempotency keys or explicit reversal routines.
Memory
Memory prevents an agent from relearning identical project facts, but it should not enforce policy. Memories are localized files or database entries derived from useful prior threads, with secrets redacted and background updates applied (Codex Memories). They load into the context window to guide behavior, not to serve as strict rules (Claude Code, Memory).
Stale memory and memory injection are the primary failure modes. A year-old note about a test command sends the agent down the wrong path. A hostile web page or tool output becomes future context if the harness records it blindly. Useful memory requires provenance, timestamps, repository scope, review, and deletion paths. It makes the next turn easier without turning past context into rigid law.
Multi-Agent Orchestration
Parallel agents alter the runtime topology. A harness orchestrates specialized concurrent subagents—explorers, reviewers, workers—operating in isolated context windows (Codex subagents). They return summaries to the main conversation, preventing deep work from bloating the primary context. Sandbox environments manage code execution and file persistency across these interactions (Gemini API Agents overview).
Multi-agent systems shift operational risk toward audit. One agent reads logs, another edits code, another runs tests, and another writes the plan. The harness preserves provenance, prevents race conditions where agents edit the same file simultaneously, and makes the parent session responsible for integration. Without coordination, parallelism yields motion over control.
Beyond Workspace Harnesses
While coding agents dominate the discussion, harnesses dictate architecture across other environments. Persistent assistant systems and verifier-backed solvers require entirely different control planes compared to repository-bound coding agents.
For persistent assistant harnesses, the control plane does not live inside a Git repository. It functions as a distributed gateway that centers on channel routing, device identity, mandatory handshakes, and long-lived session state (OpenClaw Gateway architecture). The harness here prioritizes protocol typing and pairing approval over file diffs.
Formal verification harnesses focus on rigorous state validation. For mathematical reasoning, a harness might interleave informal language model reasoning with formally verified proof steps using an external checker like Lean (HERMES paper). Here, the harness controls proof continuity and verifies state transitions, proving that the control plane's architecture must adapt tightly to the target environment's validation rules.
Evaluation Harnesses
Agent evaluation measures the runtime, not only text generation. Evaluation harnesses connect models to sandboxed environments, passing real issues, codebases, test scripts, and oracle solutions to verify success (SWE-bench GitHub repository, Terminal-Bench GitHub repository).
These benchmarks exercise harness decisions: did the agent find the right files, run the right command, preserve state, handle tool errors, and stop when evidence weakened? A strong model fails if the harness hides relevant files or blocks necessary commands. A smaller model succeeds on narrow tasks if the harness provides correct context, tool scope, and validation paths. Private evaluation harnesses test the policy layer alongside the language layer using representative repositories, allowed commands, seeded failures, network constraints, and rollback checks.
Failure Modes
Harness failures frequently masquerade as correct commands run in the wrong context.
| Failure | Mechanism | Mitigation |
|---|---|---|
| Context noise | Old tool output, stale memory, web text, or irrelevant docs enter the prompt | Source labels, context budgets, deferred tool definitions, separate web-fetch context |
| Overbroad permissions | The model receives write, network, shell, or host-tool access beyond the task | Sandbox modes, permission profiles, allowlists, approval flows, execution hooks |
| Prompt injection | Untrusted text steers privileged actions through the language channel | Treat remote content as data, verify tool trust, separate context, enforce policy in code |
| False rollback | File checkpoints work, but API calls, deployments, messages, or database writes leaked | Dry runs, plan mode, staged approval, idempotency keys, audit logs |
| Stale memory | The agent treats old project notes as current facts | Timestamps, repo scope, memory review, delete and edit paths |
| Multi-agent drift | A child session returns a summary without provenance for integration | Scoped subagents, shared diff review, parent-session verification |
Relying solely on transcripts hides these failures. Real evidence lives in logs, permissions, diffs, tool history, test output, and external side effects.
My Take
The durable advantage in agent engineering stems from the harness over model selection. Models raise the ceiling, but the harness dictates whether capability survives contact with a real repository, browser, credentials, and review process.
Engineering teams must treat agent policy like continuous integration configuration. They will version permission profiles, hooks, tool allowlists, memory rules, evaluation tasks, and rollback playbooks alongside source code. Prompt text matters, but it cannot carry the safety and operations burden alone.
References
- Claude Code, How Claude Code works
- Claude Code, Permission modes
- Claude Code, Security
- Claude Code, Memory
- Claude Code, Hooks
- OpenAI Codex, Permissions
- OpenAI Codex, Hooks
- OpenAI Codex, AGENTS.md
- OpenAI Codex, Subagents
- OpenAI Codex, Memories
- OpenAI Codex app, Review
- Google, Gemini API Agents overview
- OpenClaw Gateway architecture
- SWE-bench GitHub repository
- Terminal-Bench GitHub repository
- HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs
author: Arii tag: #ai, #architecture links: [[Three Layers — Tool, MCP, Skill]], [[Jailbreaking LLMs]], [[Small LLMs — Use Cases and Limits]], [[Multi-Token Prediction]]
